Generalized Additive Models in Fraud Detection and Pattern Recognition

Data Science Capstone Project

Author

Grace Allen, Kesi Allen, Sonya Melton, Pingping Zhou

Published

October 7, 2025

Slides

Literature Review

Introduction

Generalized Additive Models (GAMs) have emerged as a powerful extension of traditional regression methods, offering a balance between predictive flexibility and interpretability. Originally introduced by Hastie & Tibshirani (1986) and Hastie & Tibshirani (1990), GAMs build on the framework of Generalized Linear Models (GLMs) by replacing the strictly linear predictor with a sum of smooth, data-driven functions. This structure allows models to capture complex nonlinear relationships while preserving interpretability, making them especially valuable in fields where transparency is critical, including finance, healthcare, auditing, and cybersecurity. Their ability to represent nonlinear effects in a way that stakeholders and regulators can directly review has positioned GAMs as an important tool in modern statistical and machine learning applications.

The foundations of GAMs are grounded in penalized likelihood estimation and iteratively reweighted least squares (HalDa, 2012), while modern implementations such as the mgcv package in R (Wood, 2017, 2025) have greatly improved their efficiency, scalability, and robustness. Penalization techniques introduced by Wood (2017) allow smoothness control, prevent overfitting, and address issues such as concurvity, making GAMs well-suited for noisy or high-dimensional datasets. These developments have made GAMs increasingly practical for real-world applications. Transparency also remains central: as Zlaoui (2018) illustrates, GAMs provide interpretable risk curves that visualize how each feature influences an outcome, offering critical insight in high-stakes environments.

Applications of GAMs across different fields underscore their versatility. In ecology, they have been used to map species distributions and detect environmental thresholds (Detmer, 2025; Guisan et al., 2002). In biostatistics, they have informed studies of health outcomes such as alcohol use (White et al., 2020). In finance and auditing, GAMs have uncovered irregular revenue patterns and detected fraudulent Medicare billing, with results that auditors and regulators could interpret directly (Brossart et al., 2015; Miller, 2025). Even in challenging contexts where noisy or uneven data reduce precision, studies have shown that recall and interpretability remain strong advantages of the approach (Detmer, 2025; Guisan et al., 2002; Tragouda et al., 2024).

Building on these foundations, researchers have proposed several extensions and innovations. Functional and Dynamic GAMs account for functional predictors and temporal dependencies, enhancing model flexibility for forecasting and time-series applications (DGAM, 2021; FGAM, 2015). Neural-inspired variants such as Neural Additive Models (Agarwal et al., 2021) and GAMformer (GAMformer, 2023) integrate deep learning techniques, improving computational efficiency and extending the ability of GAMs to model complex nonlinear data. Bayesian approaches provide clearer ways to quantify uncertainty and guide variable selection (Miller, 2025). Other tools such as Gam.hp (2020) strengthen transparency by quantifying predictor contributions. Furthermore, Microsoft’s Explainable Boosting Machine explored by Lou et al. (2012) adapts the GAM framework to include limited interactions, improving predictive performance while retaining interpretability.

Research also highlights the role of GAMs within broader fraud detection strategies. In financial contexts, Tragouda et al. (2024) applied GAMs to bank cheque fraud, demonstrating high recall (77.8%) even when data imbalance reduced precision. Brossart et al. (2015) used GAMs to identify fraudulent Medicare billing, showing that interpretability helped build auditor trust despite challenges with adapting to emerging patterns. Miller (2025) combined GAMs with ensemble models such as random forests to detect irregular revenue in financial statements, producing visualizations auditors could use directly. Beyond GAMs, graph-based frameworks have emerged as complementary approaches. For example, Chang et al. (2022) introduced Graph Neural Additive Networks (GNANs), extending GAMs to graph-structured data such as transaction networks and achieving 84.5% ROC-AUC in detecting suspicious users. Zhang et al. (2025) demonstrated that GAMs could model sequential features in telecom fraud detection but were often outperformed by graph neural networks (GNNs) when modeling complex relational data.

In parallel, other interpretable machine learning techniques continue to shape the fraud detection landscape. Hanagandi et al. (2023) applied regularized generalized linear models, including Ridge, Lasso, and ElasticNet, to highly imbalanced credit card fraud datasets, achieving strong performance (up to 98.2% accuracy with Ridge regression) and showing that careful preprocessing is essential for real-time fraud detection. Generative approaches also contribute: Zhu et al. (2023) demonstrated how Generative Adversarial Networks (GANs) can generate synthetic transaction data to improve robustness against class imbalance. Collectively, these innovations expand the interpretability-performance frontier and highlight how transparent modeling frameworks, including GAMs and their extensions, remain central to modern fraud analytics.

The primary objectives of this analysis are to leverage the fraud detection transactions dataset to build and evaluate effective fraud detection models using Generalized Additive Models (GAMs). Specifically, the goals are:

  • Develop Robust Models: Construct models that accurately distinguish between fraudulent and legitimate transactions using GAMs.

  • Identify Key Features: Pinpoint significant variables that contribute to fraud risk, improving interpretability and providing actionable insights for financial institutions.

  • Provide Practical Insights: Generate findings that enhance anomaly detection, risk management, and financial security strategies, while addressing challenges such as noise and class imbalance.

In this study, we apply GAM methodology using RStudio and the mgcv package to the Fraud Detection Transactions Dataset from Kaggle (Ashar, 2024). This synthetic yet realistic dataset provides an opportunity to test GAMs in a controlled but meaningful context. Our aim is to evaluate whether GAMs can balance predictive strength with interpretability, creating models that are both accurate and transparent for fraud detection.

Methods

Generalized Additive Models (GAMs) extend traditional regression by allowing flexible, nonlinear relationships between predictors and the response variable. In the context of fraud detection, GAMs model the probability that a transaction is fraudulent as a smooth and interpretable function of key predictors such as transaction amount, account activity, and time of day. Continuous variables are represented with spline-based smooth functions to capture nonlinear patterns, while categorical variables are incorporated as factors. The model is fitted using the mgcv package in R, which applies penalized regression splines and generalized cross-validation (GCV) to optimize smoothness and prevent overfitting (Wood, 2017). After fitting, the smooth terms illustrate how each variable influences fraud likelihood, enabling visual interpretation of complex effects. Model performance is then evaluated using metrics such as AUC, accuracy, and recall, and the trained model is applied to the test dataset to identify fraudulent transactions.

The overall modeling process is summarized in the flow chart below, which outlines the key steps from data preparation through model evaluation and interpretation.

%%{init: {'theme': 'base', 'themeVariables': { 
  'background': '#FAFAF5',
  'primaryColor': '#4682B4',
  'secondaryColor': '#1E3A8A',
  'lineColor': '#1E3A8A',
  'nodeBorder': '#1E3A8A',
  'primaryTextColor': '#FFFFFF',
  'textColor': '#191970',
  'fontSize': '12px',
  'width': '100%'
}}}%%

flowchart TB
A["Data Preparation<br/>- Clean data<br/>- Encode categorical variables"] --> B["Exploratory Data Analysis<br/>- Check distributions<br/>- Identify predictors"]
B --> C["Split Data<br/>- Train/Test sets<br/>- Stratify by fraud outcome"]
C --> D["Specify GAM Model<br/>- Select predictors<br/>- Define smooth terms<br/>- Family = binomial"]
D --> E["Fit Model<br/>mgcv::gam(...)"]
E --> F["Evaluate Model<br/>- ROC/AUC<br/>- Confusion Matrix"]
F --> G["Interpret Results<br/>- Plot smooth effects"]
G --> H["Predict New Data<br/>- Apply model to test or new cases"]

style H fill:#FF4C4C,stroke:#8B0000,color:#FFFFFF

Equation

Formally, a GAM can be expressed as:

\[ g(\mu) = \alpha + s_1(X_1) + s_2(X_2) + \dots + s_p(X_p) \]

where \(g(\mu)\) is the link function (e.g., logit for binary outcomes or identity for continuous outcomes), \(\alpha\) is the intercept, and \(s_j(X_j)\) are smooth functions of the predictor variables \(X_j\). This structure allows each predictor to contribute a smoothed effect to the model, capturing complex patterns in the data without obscuring the individual influence of each variable. By balancing flexibility and clarity, GAMs offer a practical alternative to fully nonparametric methods, which can become computationally intensive and difficult to interpret. The additive smooth functions \(s_j(X_j)\) are at the heart of GAMs, enabling the model to uncover nonlinear patterns while maintaining interpretability for each predictor.

Assumptions

Sample Data

Analysis and Results

Data Exploration and Visualization

Data set Description

The Fraud Detection Transactions Dataset (Ashar, 2024) is a meticulously crafted, synthetic dataset that replicates real-world financial transaction patterns, making it a robust resource for building and testing fraud detection models. Hosted on Kaggle, it is tailored for binary classification tasks, with transactions labeled as fraudulent (1) or non-fraudulent (0), and is designed to simulate the complexity of financial systems while ensuring ethical data usage by avoiding real user information. The dataset’s realistic design captures nuanced fraud patterns, such as clustered fraudulent transactions, subtle anomalies, or irregular user behaviors, providing a challenging yet representative environment for machine learning applications in anomaly detection, risk assessment, and fraud prevention.

The dataset’s synthetic nature replicates realistic fraud patterns, including clustered fraudulent transactions, subtle anomalies, and irregular user behaviors, while avoiding privacy concerns. Although the exact number of records is unspecified, the data set is designed to be sufficiently large and diverse, with a mix of typical transactions and rare fraudulent events to address class imbalance — a common challenge in fraud detection. Potential data quality issues, such as noisy data, missing values, or outliers, reflect real-world complexities and require preprocessing steps like data cleaning, categorical encoding, or normalization. These challenges necessitate robust modeling techniques to handle noise and ensure accurate predictions.

Key Characteristics

The dataset simulates real-world financial transaction patterns, capturing diverse user behaviors and transaction characteristics while ensuring ethical data usage through its synthetic design. It is tailored for binary classification tasks, with transactions labeled as fraudulent (1) or non-fraudulent (0), and includes 50,000 rows of data with 21 features categorized as follows:

  • Size and Scope: Contains thousands of individual transactions, each labeled as either fraudulent (1) or non-fraudulent (0).

  • Features (21 total):

    • Numerical variables: transaction amounts, risk scores, balances, and other continuous measures.

    • Categorical variables: transaction types (e.g., payment, transfer, withdrawal), device types, and merchant categories.

    • Temporal variables: transaction time, day, and sequencing patterns that capture behavioral dynamics.

  • Label Distribution: Fraudulent transactions represent a small percentage of the data, reflecting the real-world class imbalance in fraud detection problems.

  • Realism: Although synthetic, the dataset mirrors real-world fraud scenarios by including behavioral signals, unusual spending patterns, and high-risk profiles.

Flexibility: Supports various modeling approaches, from interpretable methods (e.g., GAMs, logistic regression) to high-performance ensemble models (e.g., XGBoost).

Visualizations

Code
# Load libraries
library(tidyverse)
library(janitor)
library(gt)
library(scales)

# === Load dataset ===
data_path <- "synthetic_fraud_dataset.csv"
df <- readr::read_csv(data_path, show_col_types = FALSE) |>
  clean_names()

# === Create count tables ===
tbl_type <- df |>
  count(transaction_type, name = "Count") |>
  arrange(desc(Count)) |>
  rename(Type = transaction_type)

tbl_device <- df |>
  count(device_type, name = "Count") |>
  arrange(desc(Count)) |>
  rename(Device = device_type)

tbl_merchant <- df |>
  count(merchant_category, name = "Count") |>
  arrange(desc(Count)) |>
  rename(Merchant_Category = merchant_category)

# === Blue Theme for gt Tables ===
style_blue_gt <- function(.data, title_text) {
  .data |>
    gt() |>
    tab_header(title = md(title_text)) |>
    fmt_number(columns = "Count", decimals = 0, sep_mark = ",") |>
    tab_options(
      table.font.names = "Arial",
      table.font.size  = 14,
      data_row.padding = px(6),
      heading.align    = "left",
      table.border.top.color    = "darkblue",
      table.border.top.width    = px(3),
      table.border.bottom.color = "darkblue",
      table.border.bottom.width = px(3)
    ) |>
    tab_style(
      style = list(cell_fill(color = "darkblue"),
                   cell_text(color = "white", weight = "bold")),
      locations = cells_title(groups = "title")
    ) |>
    tab_style(
      style = list(cell_fill(color = "steelblue"),
                   cell_text(color = "white", weight = "bold")),
      locations = cells_column_labels(everything())
    ) |>
    opt_row_striping() |>
    cols_align("right", columns = "Count")
}

# === Render all three blue tables ===
style_blue_gt(tbl_type, "Table 1 – Transaction Types and Counts")
Table 1 – Transaction Types and Counts
Type Count
POS 12,549
Online 12,546
ATM Withdrawal 12,453
Bank Transfer 12,452
Code
style_blue_gt(tbl_device, "Table 2 – Device Types and Counts")
Table 2 – Device Types and Counts
Device Count
Tablet 16,779
Mobile 16,640
Laptop 16,581
Code
style_blue_gt(tbl_merchant, "Table 3 – Merchant Categories and Counts")
Table 3 – Merchant Categories and Counts
Merchant_Category Count
Clothing 10,033
Groceries 10,019
Travel 10,015
Restaurants 9,976
Electronics 9,957

Categorical Variable Count Tables

These tables display the counts for our categorical variables. While the dataset is synthetic and the categories are relatively evenly distributed, generalized additive models (GAMs) remain an appropriate analytical approach. GAMs provide the flexibility to model complex, nonlinear relationships between predictors and outcomes, accommodating both categorical and continuous variables. The even distribution of categories in the synthetic data does not compromise the validity of GAMs; it primarily affects the interpretability of specific category effects rather than the model’s overall applicability. Therefore, GAMs can still yield meaningful insights into the underlying patterns and relationships within this dataset.

Code
# Load libraries
library(ggplot2)
library(dplyr)
library(tidyr)    # For pivot_longer
library(gridExtra) # For arranging plots
#install.packages("moments") 
library(moments)   # For skewness and kurtosis
Code
library(tidyverse)
library(lubridate)
library(patchwork)  # for arranging multiple ggplots

# Load dataset
fraud_data <- read.csv("synthetic_fraud_dataset.csv")

# Convert Timestamp to date and calculate Issuance_Year if needed
fraud_data <- fraud_data %>%
  mutate(
    Timestamp = ymd_hms(Timestamp, quiet = TRUE),  # adjust format if needed
    Transaction_Year = year(Timestamp),
    Issuance_Year = Transaction_Year - Card_Age
  ) %>%
  filter(!is.na(Card_Age))  # remove rows with NA in Card_Age

# Variables to plot (move Transaction_Amount to last)
numeric_vars <- c("Account_Balance", "Transaction_Distance", "Risk_Score", "Card_Age", "Transaction_Amount")

# Create a list to store plots
plot_list <- list()

# Generate plots and store in the list
for (var in numeric_vars) {
  p <- ggplot(fraud_data, aes_string(x = var)) +
    geom_histogram(fill = "steelblue", color = "white", bins = 30) +
    labs(title = paste("Distribution of", var),
         x = var,
         y = "Count") +
    theme_light()
  
  plot_list[[var]] <- p
}

# Arrange plots in a grid: 2 plots per row
(plot_list[[1]] | plot_list[[2]]) /
(plot_list[[3]] | plot_list[[4]]) /
plot_list[[5]]  # Transaction_Amount appears last

Distribution of Numeric Variables

The transaction amount histogram shows a strong right-skewed distribution. Most transactions involve small amounts, while a few high-value transactions exist on the far right tail. This pattern indicates that fraudulent behavior may cluster around extreme transaction amounts.The skewness suggests that a log-transformation or nonlinear modeling (via GAM) can help stabilize variance and capture the curved fraud risk pattern across transaction sizes.

Code
ggplot(fraud_data, aes(x = as.factor(Fraud_Label), y = Risk_Score, fill = as.factor(Fraud_Label))) +
  geom_boxplot(alpha = 0.7) +
  scale_fill_manual(values = c("0" = "steelblue", "1" = "red"),
                    name = "Fraud Label",
                    labels = c("Legit", "Fraud")) +
  labs(title = "Distribution of Risk Scores by Fraud Label",
       x = "Fraud Label",
       y = "Risk Score") +
  theme_light() +
  theme(legend.position = "none")

Distribution of Risk Scores

The boxplot shows the distribution of Risk_Score for fraudulent versus legitimate transactions. Fraudulent transactions generally have higher scores, with a higher median and upper quartile, while legitimate transactions cluster at lower values. This suggests that Risk_Score is a meaningful feature for distinguishing fraud. Using a GAM, we can formally test how Risk_Score relates to fraud, capturing potential non-linear effects in the data.

Code
library(tidyverse)
library(lubridate)

# Load dataset
fraud_data <- read.csv("synthetic_fraud_dataset.csv")

# Convert Timestamp to date, calculate Transaction Year and Issuance Year, exclude NAs
fraud_data <- fraud_data %>%
  mutate(
    Timestamp = ymd_hms(Timestamp),               # adjust if format differs
    Transaction_Year = year(Timestamp),
    Issuance_Year = Transaction_Year - Card_Age
  ) %>%
  filter(!is.na(Issuance_Year), !is.na(Card_Age))  # remove rows with NA

# Bin Issuance Year into 5-year ranges and drop unused NA factor levels
fraud_data <- fraud_data %>%
  mutate(
    Issuance_Year_Bin = cut(Issuance_Year,
                             breaks = seq(2000, 2025, by = 5),
                             right = FALSE,
                             labels = c("2000-2004","2005-2009","2010-2014","2015-2019","2020-2024"))
  ) %>%
  filter(!is.na(Issuance_Year_Bin))  # drop any rows that fall outside the bins

# Histogram
ggplot(fraud_data, aes(x = Issuance_Year_Bin)) +
  geom_bar(fill = "steelblue", color = "white") +
  labs(title = "Card Age Distribution by Issuance Year Range",
       x = "Card Issuance Year Range",
       y = "Count") +
  theme_light()

Distribution of Card Age

Card age tends to show a left-skewed distribution — many cards are relatively new, with fewer older cards. Older cards (e.g., issued in 2015–2017) may be more vulnerable if security features are outdated.Newer cards (e.g., 2023–2024) might show different usage patterns — possibly more digital or mobile transactions.Peaks in certain years could reflect onboarding campaigns or fraud targeting specific cohorts.This suggests that fraud risk may vary by card maturity: new cards could face higher risk due to unfamiliar usage patterns. GAM’s smooth terms can model such non-monotonic age–fraud relationships.

Code
library(tidyverse)
# Load dataset
fraud_data <- read.csv("synthetic_fraud_dataset.csv")
# Ensure Fraud_Label is numeric (0/1)
fraud_data <- fraud_data %>%
  mutate(Fraud_Label = as.numeric(Fraud_Label))

# Nonlinearity check: Transaction Amount vs Fraud Probability
ggplot(fraud_data, aes(x = Transaction_Amount, y = Fraud_Label)) +
  geom_smooth(method = "loess", se = FALSE, color = "darkblue") +
  labs(title = "Relationship Between Transaction Amount and Fraud Probability",
       x = "Transaction Amount",
       y = "Fraud Probability") +
  theme_light()

Non-linearity Check

The plot shows a nonlinear relationship between transaction amount and fraud probability, supporting the use of GAM’s to flexibly model such effects. Transaction amount is a key continuous predictor, illustrating the need for a flexible approach before analyzing the full set of variables.

Modeling and Results

Conclusion

References

Agarwal, A., Frosst, N., Zhang, X., Caruana, R., & Hinton, G. (2021). Neural additive models: Interpretable machine learning with neural networks. Advances in Neural Information Processing Systems, 34, 4694–4706. https://arxiv.org/abs/2004.13912
Ashar, S. (2024). Fraud detection transactions dataset. Kaggle. https://www.kaggle.com/datasets/samayashar/fraud-detection-transactions-dataset
Brossart, D. F., Clay, D. L., & Willson, V. (2015). Detecting contaminated birthdates using generalized additive models. BMC Bioinformatics, 16(185), 1–9. https://doi.org/10.1186/s12859-015-0636-0
Chang, J., Guo, R., Zhao, L., & Liu, H. (2022). Interpretable graph learning with graph neural additive models. Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 118–128. https://doi.org/10.1145/3534678.3539310
Detmer, A. (2025). Ecological thresholds and generalized additive models. Journal of Ecology Research, 45(3), 215–230.
DGAM. (2021). Dynamic generalized additive models (DGAMs) for forecasting. PeerJ, 9, e10974. https://doi.org/10.7717/peerj.10974
FGAM. (2015). Functional generalized additive models. Statistica Sinica, 25(2), 533–558.
GAMformer. (2023). GAMformer: In-context learning for generalized additive models. arXiv preprint arXiv:2306.04301. https://arxiv.org/abs/2306.04301
Gam.hp. (2020). Evaluating the relative importance of predictors in generalized additive models using the gam.hp r package. Comprehensive R Archive Network (CRAN). https://cran.r-project.org/package=gam.hp
Guisan, A., Edwards, T. C., & Hastie, T. (2002). Generalized linear and generalized additive models in studies of species distributions: Setting the scene. Ecological Modelling, 157(2–3), 89–100. https://doi.org/10.1016/S0304-3800(02)00204-1
HalDa, C. (2012). Generalized linear models and generalized additive models (lecture notes, chapter 13). Department of Statistics, Carnegie Mellon University. http://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/13/lecture-13.pdf
Hanagandi, S., Dhar, M., & Buescher, D. (2023). Enhancing credit card fraud detection with regularized generalized linear models: A comparative analysis of down-sampling and up-sampling techniques. International Journal of Innovative Science and Research Technology, 8(9), 1533–1539.
Hastie, T., & Tibshirani, R. (1986). Generalized additive models. Statistical Science, 1(3), 297–310. http://www.jstor.org/stable/2245459
Hastie, T., & Tibshirani, R. (1990). Generalized additive models. Chapman & Hall/CRC.
Lou, Y., Caruana, R., & Gehrke, J. (2012). Intelligible models for classification and regression. Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 150–158. https://doi.org/10.1145/2339530.2339556
Miller, D. L. (2025). Gam model – fraud detection in darknet markets using generalized additive models. Figshare. https://doi.org/10.6084/m9.figshare.28618408
Tragouda, K., Papadopoulos, T., & Stefanou, A. (2024). Identification of fraudulent financial statements through a multi-label classification approach. Intelligent Systems in Accounting, Finance and Management. https://doi.org/10.1002/isaf.225
White, L. F., Jiang, W., Ma, Y., So-Armah, K., Samet, J. H., & Cheng, D. M. (2020). Tutorial in biostatistics: The use of generalized additive models to evaluate alcohol consumption as an exposure variable. Drug and Alcohol Dependence, 209, 107944. https://doi.org/10.1016/j.drugalcdep.2020.107944
Wood, S. N. (2017). Generalized additive models: An introduction with r (2nd ed.). Chapman; Hall/CRC.
Wood, S. N. (2025). Mgcv: Mixed GAM computation vehicle with automatic smoothness estimation (r package version 1.9-1). Comprehensive R Archive Network (CRAN). https://cran.r-project.org/package=mgcv
Zhang, Y., Li, X., & Chen, W. (2025). Graph-based approaches for telecom fraud detection: A comparison with generalized additive models. Journal of Computational Intelligence in Finance, 38(2), 155–170.
Zhu, M., Gong, Y., Xiang, Y., Yu, H., & Huo, S. (2023). Utilizing GANs for fraud detection: Model training with synthetic transaction data. ResearchGate. https://www.researchgate.net/publication/373914456
Zlaoui, K. (2018). A (very) quick introduction to GAMs. Medium. https://towardsdatascience.com/a-very-quick-introduction-to-gams-64f0c1f59f92